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Abstract We study synaptic plasticity in a complex neuronal cell model where 
NMDA-spikes can arise in certain dendritic zones. In the context of reinforcement 
learning, two kinds of plasticity rules are derived, zone reinforcement (ZR) and cell 
reinforcement (CR), which both optimize the expected reward by stochastic gradient 
ascent. For ZR, the synaptic plasticity response to the external reward signal is mod- 
ulated exclusively by quantities which are local to the NMDA- spike initiation zone 
in which the synapse is situated. CR, in addition, uses nonlocal feedback from the 
soma of the cell, provided by mechanisms such as the backpropagating action poten- 
tial. Simulation results show that, compared to ZR, the use of nonlocal feedback in 
CR can drastically enhance learning performance. We suggest that the availability of 
nonlocal feedback for learning is a key advantage of complex neurons over networks 
of simple point neurons, which have previously been found to be largely equivalent 
with regard to computational capability. 

Keywords Dendritic computation • reinforcement learning • spiking neuron 
1 Introduction 

Except for biologically detailed modeling studies, the overwhelming majority of 
works in mathematical neuroscience have treated neurons as point neurons, i.e., a 
linear aggregation of synaptic input followed by a nonlinearity in the generation of 
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somatic action potentials was assumed to characterize a neuron. This disregards the 
fact that many neurons in the brain have complex dendritic arborization where synap- 
tic inputs may be aggregated in highly nonlinear ways [1]. From an information pro- 
cessing perspective sticking with the minimal point neuron may nevertheless seem 
justified since networks of such simple neurons already display remarkable computa- 
tional properties: assuming infinite precision and noiseless arithmetic a suitable net- 
work of spiking point neurons can simulate a universal Turing machine and, further, 
impressive information processing capabilities persist when one makes more realis- 
tic assumptions such as taking noise into account (see [2] and the references therein). 
Such generic observations are underscored by the detailed compartmental model- 
ing of the computation performed in a hippocampal pyramidal cell [3]. There it was 
found that (in a rate coding framework) the input-output behavior of the complex cell 
is easily emulated by a simple two layer network of point neurons. 

If the computations of complex cells are readily emulated by relatively simple cir- 
cuits of point neurons, the question arises why so many of the neurons in the brain 
are complex. Of course, the reason for this may be only loosely related to informa- 
tion processing proper, it might be that maintaining a complex cell is metabolically 
less costly than the maintenance of the equivalent network of point neurons. Here, 
we wish to explore a different hypothesis, namely that complex cells have crucial ad- 
vantages with regard to learning. This hypothesis is motivated by the fact that many 
artificial intelligence algorithms for neural networks assume that synaptic plasticity 
is modulated by information which arises far downstream of the synapse. A promi- 
nent example is the backpropagation algorithm where error information needs to be 
transported upstream via the transpose of the connectivity matrix. But in real axons 
any fast information flow is strictly downstream, and this is why algorithms such 
as backpropagation are widely regarded as a biologically unrealistic for networks of 
point neurons. When one considers complex cells, however, it seems far more plau- 
sible that synaptic plasticity could be modulated by events which arise relatively far 
downstream of the synapse. The backpropagating action potential, for instance, is of- 
ten capable of conveying information on somatic spiking to synapses which are quite 
distal in the dendritic tree [4, 5]. If nonlinear processing occurred in the dendritic tree 
during the forward propagation, this means that somatic spiking can modulate synap- 
tic plasticity even when one or more layers of nonlinearities lie between the synapse 
and the soma. Thus, compared to networks of point neurons, more sophisticated plas- 
ticity rules could be biologically feasible in complex cells. 

To study this issue, we formalize a complex cell as a two layer network, with the 
first layer made up of initiation zones for NMDA-spikes (Figure 1). NMDA- spikes 
are regenerative events, caused by AMPA mediated synaptic releases when the re- 
leases are both near coincident in time and spatially co-located on the dendrite [6-8]. 
Such NMDA-spikes boost the effect of the synaptic releases, leading to increases 
in the somatic potential which are stronger as well as longer compared to the effect 
obtained from a simple linear superposition of the excitatory post synaptic poten- 
tials from the individual AMPA releases. Further, we assume that the contribution 
of NMDA-spikes from different initiation zones combine additively in contributing 
to the somatic potential and that this potential governs the generation of somatic ac- 
tion potentials via an escape noise process. While we would argue that this provides 
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an adequate minimal model of dendritic computation in basal dendritic structures, 
one should bear in mind that our model seems insufficient to describe the complex 
interactions of basal and apical dendritic inputs in cortical pyramidal cells [9, 10]. 

We will consider synaptic plasticity in the context of reinforcement learning, 
where the somatic action potentials control the delivery of an external reward signal. 
The goal of learning is to adjust the strength of the synaptic releases (the synaptic 
weights) so as to maximize the expected value of the reward signal. In this frame- 
work, one can mathematically derive plasticity rules [1 1, 12] by assuming that weight 
adaption follows a stochastic gradient ascent procedure in the expected reward [13]. 
Dopamine is widely believed to be the most important neurotransmitter for such re- 
ward modulated plasticity [14-16]. A simple minded application of the approach in 
[13] leads to a learning rule where, except for the external reward signal, plastic- 
ity is determined by quantities which are local to each NMDA- spike initiation zone 
(NMDA-zone). Using this rule, NMDA-zones learn as independent agents which are 
oblivious of their interaction in generating somatic action potentials, with the exter- 
nal reward signal being the only mechanism for coordinating plasticity between the 
zones, hence we shall refer to this rule as zone reinforcement (ZR). Due to its simplic- 
ity, ZR would seem biologically feasible even if the network were not integrated into 
a single neuron. On the other hand, this approach to multi-agent reinforcement often 
leads to a learning performance which deteriorates quickly as the number of agents 
(here, NMDA-zones) increases since it lacks an explicit mechanism for differentially 
assigning credit to the agents [17, 18]. By algebraic manipulation of the gradient for- 
mula leading to the basic ZR-rule, we derive a class of learning rules where synaptic 
plasticity is also modulated by somatic responses, in addition to reward and quantities 
local to the NMDA-zone. Such learning rules will be referred to as cell reinforcement 
(CR), since they would be biologically unrealistic if the nonlinearities where not in- 
tegrated into a single cell. We present simulation result showing that one rule in the 
CR-class results in learning which is much faster than for the ZR-rule. This provides 
evidence for the hypothesis that enabling effective synaptic plasticity rules may be 
one evolutionary advantage conveyed by dendritic nonlinearities. 
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2 Stochastic cell model of a neuron 

We assume a neuron with N = 40 initiation zones for NMDA- spikes, indexed by v = 
1, . . . , N. An NMDA-zone is made up of M v synapses, with synaptic strength Wi fV 
(i = 1, . . . , My), where releases are triggered by presynaptic spikes. We denote by 
Xi }V the set of times when presynaptic spikes arrive at synapse (/, v). In each NMDA- 
zone, the synaptic releases give rise to a time varying local membrane potential u v 
which we assume to be given by a standard spike response equation 

My 

u v (t;X) = U re st + J2 Wi > v e(t-s). (1) 

i seXi^ 

Here, X denotes the entire presynaptic input pattern of the neuron, £/ rest = — 1 (ar- 
bitrary units) is the resting potential, and the postsynaptic response kernel € is given 
by 

e (t)=-®yL(e- t /*-- e -« T '). 

We use x m — 10 ms for the membrane time constant, t s — 1.5 ms for the synaptic rise 
time, and 0 is the Heaviside step function. 

The local potential u v controls the rate at which what we call NMDA-events are 
generated in the zone - in our model NMDA-events are closely related to the onset of 
NMDA-spikes as described in detail below. Formally, we assume that NMDA-events 
are generated by an inhomogeneous Poisson process with rate function (p^(u v (t; X)), 
choosing 

(fo(x) = qsefo x (2) 

with q N = 0.005 and /3n = 3. We adopt the symbol Y v to denote the set of NMDA- 
event times in zone v. For future use, we recall the standard result [19] that the prob- 
ability density P w v (Y v |X) of an event- train Y v generated during an observation pe- 
riod running from t = 0 to T satisfies 

log/V v (r|X) = [ T dt log(q N e^ u ^ x) )y v (t) - q N e^ u ^ x \ (3) 
Jo 

where y v (t) = Y2 seY v ~ s ^ ^ s m e 5-function representation of Y v . 

Conceptually, it would be simplest to assume that each NMDA-event initiates a 
NMDA-spike. But we need some mechanism for refractoriness, since NMDA-spikes 
have an extended duration (20-200 ms) and there is no evidence that multiple simul- 
taneous NMDA-spikes can arise in a single NMDA-zone. Hence, we shall assume 
that, while a NMDA-event occurring in temporal isolation causes a NMDA-spike, a 
rapid succession of NMDA-events within one zone only leads to a somewhat longer 
but not to a stronger NMDA-spike. In particular, we will assume that a NMDA-spike 
contributes to the somatic potential during a period of A = 50 ms after the time of 
the last preceding NMDA-event. Hence, if a NMDA-event is followed by a second 
one with a 5 ms delay, the first event initiates a NMDA-spike which lasts for 55 ms 
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due to the second NMDA-event. Formally, we denote by sy v (t) = max{s <t\seY v ) 
the time of the last NMDA-event up to time t and model the somatic effect of an 
NMDA-spike by the response kernel 

= { . (4) 

1 0 otherwise. 

The main motivation for modeling the generation of NMDA- spikes in this way is 
that it proves mathematically convenient in the calculations below. Having said this, 
it is worthwhile mentioning that treating NMDA- spikes as rectangular pulses seems 
reasonable, since their rise and fall times are typically short compared to the duration 
of the spike. Also, there is some evidence that increased excitatory presynaptic activ- 
ity extends the duration of a NMDA-spike but does not increase its amplitude [7, 8]. 
Qualitatively, the above model is in line with such findings. 

For specifying the somatic potential U of the neuron, we denote by Y the vector of 
all NMDA-event trains Y v and by Z the set of times when the soma generates action 
potentials. We then use 

N 

U (t; Y, Z) = [/ rest + J2aVr>(t)-J2 K(f - s) (5) 

v—\ seZ 

for the time course of the somatic potential, where the reset kernel tc is given by 

K(t) = ®(t)e- t/Tm . 

This is a highly stylized model of the somatic potential since we assume that NMDA- 
zones contribute equally to the somatic potential (with a strength controlled by the 
positive parameter a) and that, further, the AMPA-releases themselves do not con- 
tribute directly to U. Even if these restrictive assumptions may not be entirely un- 
reasonable (for instance, AMPA-releases can be much more strongly attenuated on 
their way to the soma than NMDA-spikes) we wish to point out that, while becoming 
simpler, the mathematical approach below does not rely on these restrictions. 

Somatic firing is modeled as an escape noise process with an instantaneous rate 
function 0s (U (t; Y, Z)) where 

( / )s (x)=qse^ x (6) 

with q$ = 0.005 and p$ = 5. As shown in [20], for the probability density P(Z\Y) of 
responding to the NMDA-events with a somatic spike train Z during the observation 
period this implies 

log P(Z\Y) = f At log{q^ U ^)Z{t) - q s e^ z ^ (7) 

with z{t) =j: sez Ht-s). 
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3 Reinforcement learning 

In reinforcement learning, one assumes a scalar reward function R(Z, X) providing 
feedback about the appropriateness of the somatic response Z to the input X. The goal 
of learning is to adapt the synaptic strengths so as to obtain appropriate somatic re- 
sponses. For our neuronal model, the expected value R of the reward signal R(Z, X) 
is 

R(w) = j dXdYdZ P(X)P w (Y|X)P(Z|Y)fl(Z,X), (8) 

where P(X) is the probability density of the input spike patterns and P W (Y|X) = 
riiLi v IX). The goal of learning can now be formalized as finding a w max- 
imizing R and synaptic plasticity rules can be obtained using stochastic gradient as- 
cent procedures for this task. 

In stochastic gradient ascent, X, Y, and Z are sampled at each trial and every 
weight is updated by 

m,V +- W iiV + T]gi, v (X, Y, Z), 

where r\ > 0 is the learning rate and g;, v (X, Y, Z) is an (unbiased) estimator of 
o^—R- Under mild regularity conditions, convergence to a local optimum is guar- 
anteed if one uses an appropriate schedule for decreasing r\ towards 0 during learning 
[21]. In biological modeling, one usually simply assumes a small but fixed learning 
rate. 

The derivative of R with respect to the weight of synapse (/, v) can be written 

as 

9 C 9 
R = / dXdYdZP(X)P w (Y|X)P(Z|Y)#(Z,X)- log P w . V (Y V \X). (9) 

Hence, a simple choice for the gradient estimator is 

gf*(X, Y, Z) = R(Z, X) log iV v (Y v |X) (10) 

with P w (Y v |X) given by Equation 3. Note that the conditional probability P(Z\Y) 
does not explicitly appear in the estimator, so the update is oblivious of the architec- 
ture of the model neuron, i.e., of how NMDA-events contribute to somatic spiking. 
Since the only learning mechanism for coordinating the responses of the different 
NMDA-zones is the global reward signal R(Z, X), we refer to the update given by 
Equation 10 as ZR. 

Better plasticity rules can be obtained by algebraic manipulations of Equations 8 
and 9 which yield gradient estimators which have a reduced variance compared to 
Equation 10 - this should lead to faster learning. A simple and well-known example 
for this is adjusting the reinforcement baseline by choosing a constant c and replac- 
ing R(Z,X) with R(Z,X) + c in Equation 10; this amounts to adding c to /?(w) 
and hence does not change the gradient. But a judicious choice of c can reduce the 
variance of the gradient estimator. More ambitiously, one could consider analytically 
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integrating out Y in Equation 8, yielding an estimator which directly considers the 
relationship between synaptic weights and somatic spiking because it is based on 
log P W (Z|X). While actually doing the integration analytically seems impracti- 
cal, we shall obtain estimators below from a partial realization of this program. 



4 From zone reinforcement to cell reinforcement 

Due to the algebraic symmetries of our model cell, it suffices to give explicit plasticity 
rules only for one synaptic weight. To reduce clutter we will thus focus on the first 
synapse w\^\ in the first NMDA-zone. 

4.1 Notational simplifications 

Let denote the vector (Y 2 , ... , Y N ) of all NMDA-event trains but the first and 
the collection of synaptic weights (iu. j 2,...,w.,#)inall but the first NMDA-zone. 
We rewrite the expected reward as 

r(>.,i,X,Y\) 

Since in Equation 11 only r depends on w\j we just need to consider g^j-yf. Hence, 

we can regard X and Y^ as fixed and suppress them in the notation. This allows us to 
write the somatic potential (Equation 5) simply as 

U(t; Z, Y) = U base (t; Z) +fl*y(f) (12) 

using Y as shorthand for the NMDA-event train Y 1 of the first zone and, further, 
incorporating into a time varying base potential t/base the following contributions in 
Equation 5: (i) the resting potential, (ii) the influence of Y\ i.e., NMDA-events in the 
other zones, (iii) any reset caused by somatic spiking. Similarly, the notation for the 
local membrane potential of the first NMDA-zone becomes 

u(t) 

— ^base (t) + w1r(t), (13) 

where w stands for the strength w\,\ of the first synapse, ^(t) = Y2 seX i i € ^ ~ s ^ 
and the effect of the other synapses impinging on the zone is absorbed into w D ase(0- 
Finally, the u; -dependent contribution r to the expected reward (Equation 11) can be 
written as 

r(w) = j dZdYP(Z\Y)P w (Y)R(Z), (14) 

where also for R and P w we have suppressed the dependence on X. In the reduced 
notation, the explicit expression (obtained from Equations 3 and 10) for the gradient 



= j dXdY\p(X)P w \(Y\|X)r(u;.,i,X,Y\) with 

\ (11) 
= / dZdr 1 P(Z|Y)P Wj . 1 (F 1 |X)7?(Z,X). 



Springer 



Page 8 of 19 



Schiess et al. 



estimator in ZR-learning is 

g ZR (Y, Z) = R(Z) [ T dt (y(t) - q N e^)M(t). (15) 
Jo 

4.2 Cell reinforcement 

To simplify the manipulation of Equation 14, we replace the Poisson process generat- 
ing Y by a discrete time process with step-size 8 > 0. We assume that NMDA-events 
in Y can only occur at times tk = k8 where k runs from 1 to K = IT/ 8} and intro- 
duce K independent binary random variables £ {0, 1} to record whether or not a 
NMDA-event occurred. For the probability of not having a NMDA-event at time tk 
we use 

P w (y k= 0)= e - 8 ^ (Uitk) \ (16) 

With this definition, we can recover the original Poisson process by taking the limit 
8 -> +0. We use y = (yi, . . . , v^) to denote the entire response of the NMDA-zone 
and, to make contact with the set-based description of the NMDA-trains, we denote 
by y the set of NMDA-event times in y, i.e., y = {tk\yk = !}• Next, the discrete time 
version of Equation 14 is 

r 8 (w)= f dZ^]/?(Z)P(Z|y)P Wj (y), (17) 

where P w (y) = Ylk=i Pw(yk)- ln me end, we will recover r from r§ by taking 8 to 
zero. 

The derivative of Equation 17 is 

3 C 9 

— n= / &zYP{Z\y)P w (y)R{Z)Y — \ogP w {y k ) 
dw J ^ f— ' dw 

y k=\ 

and to focus on the contributions to J-rx from each time bin we set 

aw ° 

dZ Y Pw (y)P(Z\y)R(Z)— log P w (yk). (18) 
y 

Hence, j^n = Y^ =l gmd k . 

We now exploit the trivial fact that we can think of P(Z|y) as a function linear in 
y&, simply because y& is binary. As a consequence, we can decompose P(Z|y) into 
two terms: one which depends on y& and one which does not. For this, we pick a 
scalar fi and rewrite P(Z\y) as 

P(Z|y) = a(y\*) + (y k - /x)^(y^), (19) 
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where jA* = (yi , . . . , y^-i , y k+ \ , . . . , yx) and 

a(y\*) = /xP(Z|y U {ft}) + (1 - /x)P(Z|y \ {t k }) 
P(y\ k ) = P(Z\y U {t k }) - P(Z\y \ {t k }). 

Plugging Equation 19 into Equation 18 yields grad^ as sum of two terms 

grad^ = Ak + Bk where 

/9 
dZ J2P w (j)a(j\ k )R(Z)— log P w (y k ) (20) 
^ aw 
y 

/9 
dZ V P w (y) R(Z) (y k - Li)p(y^ k ) — log P w (y k ). 
^ aw 
y 

Rearranging terms in A^, we get 

/9 
dZ V P w (j Xk )R(Z)a(y\ k ) V P„,(3fc) — log P„,Gfr). 

Now, E„ P* Gfr) 4 lQ g P - C^ifc) = 4 P - <3fc) = 4 1 = 0, hence 

A^=0 and grad^ = ^. (21) 

The two equations above encapsulate our main idea for improving on ZR. In show- 
ing that A k = 0 we summed over the two outcomes yk G {0, 1}, thus identifying a 
noise contribution in the ZR estimator R(Z) log P w (yk) for grad^ which vanishes 
through the averaging by the sampling procedure. Note that the remaining contribu- 
tion Bk has as factor fi(y\ k ), a term which explicitly reflects how a NMDA-event at 
time tk contributes to the generation of somatic action potentials. In going from Equa- 
tion 20 to Equation 21, we assumed that the parameter fi was constant. However, a 
quick perusal of the above derivation shows that this is not really necessary. For jus- 
tifying Equation 21, one just needs that fi does not depend on y k , so that a(y\ k ) is 
indeed independent of yk . In the sequel, it shall turn to be useful to introduce a value 
of fi which depends on somatic quantities. 

A drawback of Equations 20 and 21 is that they do not immediately lend them- 
selves to Monte-Carlo estimation by sampling the process generating neuronal 
events. The reason being the missing term P(Z\y) in the formula for Bk. To rein- 
troduce the term, we set 

Pj(t k ) = P(j y *)/P(Z\j) (22) 
and in view of Equations 20 and 21 have 

dZ Y^P w (y)P(Z\y)R(Z)(y k - riPyitk) — \ogP w (y k ). 
^ aw 
y 
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Hence, R(Z)(yk — /x)/3 y (^) log P w (yk) is an unbiased estimator of grad^ and, 
since grad^ gives the contribution to j^r& from the kth time step, 

k 9 
gf R = R(Z) Y(yk - ^)Py Ok)— log P w (y k ) (23) 

is an unbiased estimator of j^r&. Note that, while unavoidable, the above recasting 
of the gradient calculation as an estimation procedure does seem risky. Due to the 
division by P(Z\y) in introducing /3, Equation 22, rare somatic spike trains Z can 
potentially lead to large values of the estimator gf R . 

To obtain a CR estimator g CR for the expected reward R in our original problem, 
we now just need to take 8 to 0 in Equation 23 and tidy up a little. The detailed 
calculations are presented in Appendix 1, here we just display the final result: 

g CR (Y, Z) = R(Z) [ T dt ((1 - fiKl - e-n®)y(t) 
Jo 

P(Z\YU{t}) 

yyit) = log (24) 

6 P(Z\Y\{t}) 

"min(7V+A) 



dj(l-*r\w(j)) 



In contrast to the ZR-estimator, g CR depends on somatic quantities via yy(t) which 
assesses the effect of having a NMDA-event at time t on the probability of the ob- 
served somatic spike train. This requires the integration over the duration A of a 
NMDA-spike. 

The CR-rule can be written as the sum of two terms, a time-discrete one depending 
on the NMDA-events y, and a time-continuous one depending on the instantaneous 
NMDA-rate, both weighted by the effect of an NMDA-event on the probability of 
producing the somatic spike train: 

PR f T P(Z\YU{t})- P(Z\Y\{t}) 

■ CR (Y, Z) = (1 - fi)R(Z) I dt p\z\YU{t}) ^^(0 



P(Z\YU{t})-P(Z\Y\{t}) jNU(t) 



i P(Z\Y\{t}) H 



5 Performance of zone and cell reinforcements 

To compare the two plasticity rules, we first consider a rudimentary learning scenario 
where producing a somatic spike during a trial of duration T = 500 ms is deemed an 
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incorrect response, resulting in reward R(Z, X) = —1. The correct response is not to 
spike (Z = 0) and this results in a reward of 0. With these reward signals, synaptic 
updates become less frequent as performance improves. This compensates somewhat 
for having a constant learning rate instead of the decreasing schedule which would 
ensure proper convergence of the stochastic gradient procedure. We use a = 0.5 for 
the NMD A- spike strength in Equation 5, so that just 2-3 concurrent NMDA- spikes 
are likely to generate a somatic action potential. The input pattern X is held fixed and 
initial weight values are chosen so that correct and incorrect responses are equally 
likely before learning. Simulation details are given in Appendix 2. Given our choice 
of a and the initial weights, dendritic activity is already fairly low before learning and 
decreasing it to a very low level is all that is required for good performance in this 
simple task (Figure 2). 

Simulations for ZR and CR (with a constant value of fi = ^) are shown in 
panel 6A. Given the sophistication of the rule, the performance of CR is disappoint- 
ing, yielding on average only a modest improvement over ZR. The histogram in 
panel 6B shows that in most cases CR does in fact learn substantially faster than ZR 
but, in contrast to ZR, CR spectacularly fails on some runs. Performance in a bad run 
of the CR-rule is shown in panel 6C, revealing that performance can deteriorate in a 
single trial. In this trial, a very unlikely somatic response was observed (panel 6D), 
resulting in a large value of yy, thus leading to an excessively large change in synaptic 
strength. 

The finding that large fluctuations in the CR-estimator can arise from rare somatic 
events, confirms the suspicion in Section 4.2 that recasting Equation 20 as a sampling 
procedure can lead to problems. Luckily, this can be addressed using the additional 
degree of freedom provided by the parameter fi in the CR-rule. To dampen the effect 
of the fluctuations in yy, we set fi to the time-dependent value 

= 1 = P(Z\Y\{t}) 
M 1 + P(Z\YU{t}) + P(Z\Y\{t})' 

Note that fi is independent of whether or not t e Y. Hence, in view of our remark 
following Equation 21, this is in fact a valid choice for fi. The specific form of Equa- 
tion 25 is to some extent motivated by the aesthetic considerations. It simplifies the 
first line of Equation 24 to 

g hCR (Y, Z) = R(Z) J* dt tanh Qky(o) (3>(0 + q^ u{t) ) M it) . (26) 

We refer to this estimator as balanced cell reinforcement (bCR) (Figure 3). 

From the third line of Equation 24, one sees that the somato-dendritic interaction 

term in Equation 26 can be written as tanh(^yy(0) = p(z]rulf})+P(z|i^ • ^ ms 
highlights the terms role as assessing the relevance to the produced somatic spike 
train of having an NMDA-event at time t. In this, it is analogous to the e ±YY terms in 
the CR-rule. But in contrast to these terms, tanh(^yy) is bounded. In ZR, plasticity 
is driven by the exploration inherent in the stochasticity of NMDA-event generation. 
Formally, this is reflected by the difference y(t) — q^e^ u ^ entering as a factor 
in Equation 15, which represents the deviation of the sampled NMDA-events from 
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Fig. 2 Learning to stay quiescent. (A) Learning curves for cell reinforcement (blue) and zone reinforce- 
ment (red) when the neuron should not respond with any somatic firing to one pattern which is repeatedly 
presented. Values shown are averages over 40 runs with different initial weights and a different input 
pattern. (B) Distributions of the performance after 1500 trials. (C) A bad run of the CR-rule where per- 
formance drops dramatically after the 397th pattern presentation. The grey points show the Euclidean 
norm of the change ||AW|| in the neurons weight matrix W, highlighting the excessively large synap- 
tic update after trial 397. (D) Time course of the somatic potential during trial 397 (the straight line 
at t = 219 ms marks a somatic spike). As shown more clearly by the blow-up in the bottom row an 
NMD A- spike occurring at t* = 232 ms yields a value of U which stays strongly positive for some 10 ms. 
(U drops thereafter because a NMDA-spike in a different zone ends.) Improbably, however, the sustained 
elevated value of U after t* does not lead to a somatic spike. Hence, the likelihood of the observed so- 
matic response Z given the activity Y v in the zone v where the NMDA-spike at time t* occurred is quite 
small, P(Z[ f * ,f*+A] \Y V ) = ^(Z[r*,r*+A] \Y V U {t*}) % 0.017. Indeed, the actual somatic response would 
have been much more likely without the NMDA-spike, P(Z[ ts j s+ /^\Y v \ {t*}) ~ 0.72. The discrepancy 
between the two probabilities yields a large value of exp(— yyv (t*)) in Equation 24, leading to the strong 
weight change. Error bars in the figure show 1 SEM. 



the expected rate. In bCR, this difference has become a sum. Hence, exploration 
at the NMDA-event level is only of minor importance for the bCR-rule, where the 
essential driving force for plasticity is the somatic exploration entering through the 
factor tanh(^yy). 

Due to the modification, bCR consistently and markedly improves on ZR, as 
demonstrated by panel 5A which compares the learning curves for the same task 
as in panel 6A. The performance improvement seems to become even larger for more 
demanding tasks. This is highlighted by panel 5B showing the performance when 
not just one but four different stimulus-response associations have to be learned. For 
two of the patterns, the correct somatic response was to emit at least one spike, for 
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Fig. 3 Balanced cell reinforcement (bCR, Equation 26) compared to zone reinforcement. (A) Average 
performance of bCR (green) and ZR (red) on the same task as in panel 6A. (B) Performance when learning 
stimulus-response associations for four different patterns; bCR (green), ZR (red), a logarithmic scale is 
used for the x-axis. The inset shows the distribution of NMDA-spike durations after learning the task 
with bCR. The performance values in the figure are averages over 40 runs, and error bars show 1 SEM. 
(C) Development of the average reward signal R(Z) for bCR (green) and ZR (red) when the task is to spike 
at the mid time of the single input pattern (R(Z) = -2/(nT)J2i \t- V ~ t targ \, where ^ sp e Z, i = 1, . . . , n, 
is the ith of the n output spike times, £ targ = 250 ms the target spike time, and T = 500 ms the pattern 
duration; if there was no output spike within [0, T) we added one at T , yielding R(Z) = —1). (D) Spike 
raster plot of the output spike times Z with R(Z) shown in C using bCR. With ZR, the distribution of 
spike times after 3000 trials roughly corresponds to the one for bCR after 160 trials (vertical line at *), 
where the two performances coincide (see * and black lines in C). The mean and standard deviation of 
the spike times at the end of the learning process, averaged across the last 300 trials, was 251 ±45 and 
256 ± 121 ms for bCR and ZR, respectively. 



the other two patterns the correct response was to stay quiescent. One of the four 
stimulus-response associations was randomly chosen on each trial and, as before, cor- 
rect somatic responses lead to a reward signal of R = 0 whereas incorrect responses 
resulted in R = — 1 . The inset to panel 5B shows the distribution of NMDA-spike du- 
rations after learning the four stimulus-response associations with bCR. Over 70% of 
the NMDA-spikes last for just a little longer than the minimal length of A = 50 ms. 
Further nearly all of the spikes are shorter than 100 ms, thus staying well within a 
physiologically reasonable range. 

Panels 5C and 5D show results in a task where reward delivery is contingent on 
an appropriate temporal modulation of the firing rate. Also, in this second output 
coding paradigm, the bCR-update is found to be much more efficient in estimating 
the gradient of the expected reward. 
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6 Discussion 

We have derived a class of synaptic plasticity rules for reinforcement learning in 
a complex neuronal cell model with NMDA-mediated dendritic nonlinearities. The 
novel feature of the rules is that the plasticity response to the external reward signal 
is shaped by the interaction of global somatic quantities with variables local to the 
dendritic zone where the nonlinear response to the synaptic release arises. Simulation 
results show that such so-called CR rules can strongly enhance learning performance 
compared to the case where the plasticity response is determined just from quantities 
local to the dendritic zone. 

In the simulations, we have considered only a very simple task with a single 
complex cell learning stimulus-response associations. The results, however, show 
that compared to ZR the bCR rule provides a less noisy procedure for estimating 
the gradient of the log-likelihood of the somatic response given the neuronal input 
(gj— log P W (Z|X)). Estimating this gradient for each neuron is also the key step for 
reinforcement learning in networks of complex cells [13]. Further, simply memoriz- 
ing the gradient estimator with an eligibility trace until reward information becomes 
available, yields a learning procedure for partially observable Markov decision pro- 
cesses, i.e., tasks where the somatic response may have an influence on which stim- 
uli are subsequently encountered and where reward delivery may be contingent on 
producing a sequence of appropriate somatic responses [22-24]. The quality of the 
gradient estimator is a crucial factor also in these cases. Hence, it is safe to assume 
that the observed performance advantage of the bCR rules carries over to learning 
scenarios which are much more complex than the ones considered here. 

In this investigation, we have adopted a normative perspective, asking how the 
different variables arising in a complex neuronal model should interact in shaping 
the plasticity response - striving for maximal mathematical transparency and not for 
maximal biological realism. Ultimately, of course, we have to face the question of 
how instructive the obtained results are for modeling biological reality. The question 
has two aspects which we will address in turn: (A) Can the quantities shaping the 
plasticity response be read-out at the synapse? (B) Is the computational structure of 
the rules feasible? 

(A) The global quantities in CR are the timing of somatic spikes as well as the 
value of the somatic potential. The fact that somatic spiking can modulate plasticity 
is well established by STDP experiments (spike timing-dependent plasticity). In fact 
such experiments can also provide phenomenological evidence for the modulation of 
synaptic plasticity by the somatic potential, or at least by a low-pass filtered version 
thereof. The evidence arises from the fact that the synaptic change for multiple spike 
interactions is not a linear superposition of the plasticity found when pairing a single 
pre- synaptic and a somatic spike. Explaining the discrepancy seems to require the 
introduction of the somatic potential as an additional modulating factor [25]. 

In CR-learning, however, we assume that the somatic potential U (Equation 5) 
can differ substantially from a local membrane potential u v (Equation 1) and both 
potentials have to be read-out by a synapse located in the vth dendritic zone. In a 
purely electrophysiological framework, this is nonsensical. The way out is to note 
that what a synapse in CR-learning really needs is to differentiate between the total 
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current flow into the neuron and the flow resulting from AMPA-releases in its local 
dendritic NMDA-zone. While the differential contribution of the two flows is going to 
be indistinguishable in any local potential reading, the difference could conceivably 
be established from the detailed ionic composition giving rise to the local potential 
at the synapse. A second, perhaps more likely, option arises when one considers that 
NMD A- spiking is widely believed to rely on the pre-binding of Glutamate to NMDA- 
receptors [7]. Hence, u v could simply be the level of such NMDA-receptor bound 
Glutamate, whereas U is relatively reliably inferred from the local potential. Such a 
reinterpretation does not change the basic structure of our model, although it might 
require adjusting some of the time constants governing the build up of u v . 

(B) The plasticity rules considered here integrate over the duration T correspond- 
ing to the period during which somatic activity determines eventual reward delivery. 
But synapses are unlikely to know when such a period starts and ends. As in pre- 
vious works [12, 18], this can be addressed by replacing the integral by a low-pass 
filter with a time constant matched to the value of T. The CR-rules, however, when 
evaluating yy(t) to assess the effect of an NMDA- spike, require a second integration 
extending from time t into the future up to t + A. The acausality of integrating into 
the future can be taken care of by time shifting the integration variable in the first 
line of Equation 24, and similarly for Equation 26. But the time shifted rules would 
require each synapse to buffer an impressive number of quantities. Hence, further ap- 
proximations seem unavoidable and, in this regard, the bCR-rule (Equation 26) seem 
particularly promising due to its relatively simple structure. Approximating the hy- 
perbolic tangent in the rule by a linear function yields an update which can be written 
as a proper double integral. This is an important step in obtaining a rule which can be 
implemented by a biologically reasonable cascade of low-pass filters. 

The derivation of the CR-rules presented above builds on previous work on re- 
inforcement learning in a population of spiking point neurons [18, 24, 26]. But in 
contrast to neuronal firings, NMDA-spikes have a non-negligible extended duration 
and this makes the plasticity problem in our complex cell model more involved. The 
previous works introduced a feedback signal about the population decision which 
has a role similar to the somatic feedback in the present CR-rules. A key difference, 
however, is that the population feedback had to be temporally coarse grained since 
possible delivery mechanisms such as changing neurotransmitters levels are slow. In 
a complex cell model, however, a close to instantaneous somatic feedback can be 
assumed. As a consequence, the CR-rules can now support reinforcement learning 
also when the precise timing of somatic action potentials is crucial for reward de- 
livery. Yet, if the soma only integrates NMDA-spikes which extend across 50 ms or 
more, it appears to be difficult to reach a higher temporal precision in the somatic 
firing. In real neurons, the temporal precision is likely to result from the interaction 
of NMDA-spikes with AMPA-releases, with the NMDA-spikes determining periods 
of heightened excitability during which AMPA-releases can easily trigger a precise 
somatic action potential. While important in terms of neuronal functionality, incorpo- 
rating the direct somatic effect of AMPA-releases into the model poses no mathemat- 
ical challenge, just yielding additional plasticity terms similar to the ones for point 
neurons [20]. To focus on the main mathematical issues, we have not considered such 
direct somatic effects here. 
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Appendix 1 

Here, we detail the steps leading from Equation 22 for gf R to Equation 24 for g CR . 

We first obtain a more explicit form for gf R . In view of Equation 22, fiy(tk) = 
$§gg - 1 if y k = 0, whereas ft (ft) = 1 - ™|g if there is NMDA- 
triggering at time tk . Hence, setting 

KY(0 = log ^f^|f wehave ^yto) = ( 2 ^-l)(l-^ fe)(1 " 2 ^) 



and hence 



cr 



9 

(7, Z) = *(Z) Vte - ti)(2y k - 1)(1 - ^to) (1 - 2%) ) — log iUy*). 
^-^ v 7 aw 

k—\ 



Further, from Equation 16 

9 



„ logP u; (^ = l)=^fe) + 0(5), 
9 

— logi^O* = 0) = -8faq N efo u « k) 1r(t k ). 
aw 



Hence, taking the limit 8 —> 0, we obtain 



g CR (Y,Z) = R(Z) [ d^ N ^(0((l-/x)(l 



g -^(o )y(0 



-^ NM( 'V(i-^ (0 )), 

equivalent to the first equation in Equation 24. 

We next need an explicit expression for yy(t). Going back to its definition (Equa- 
tion 24) and using Equations 7 and 12 yields 

Mt) = f r (logfe^ sf/(5;Z ' FUW) )^(^) -qse^ s > Z > YU{t]) )ds 
Jo 

- [\log( qs e^ z > Y \^ 
Jo 

= f Ps (U(s; Z,YU {t}) - U(s; Z,Y\ {t})) Z(s) ds 
Jo 

- f % s (e&tf(*z,ru{*>) - e W;Z>Y\M)ds 
Jo 

= [ Psa(VYU[t}(s)-V Y \{t}(s))Z(s)ds 
Jo 

- f T q s e^ sUhase(s;Z \e^ sa ^ Y[J ^ {s) - e ^ Y \^ {s) ) ds. 
Jo 
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We next note that times s outside of the interval [t, t + A] do not contribute to the 
above integrals since ^YU{t}(s) = x I / f\{^(5) for such s. Further, ^YU{t}(s) = 1 for 
s e [t, t + A]. Hence, 



For the term in square brackets we note that, since tyy\{f} ($) is zero or one, e a ^ s — 



e a^Y\{t}(s) = e afc _ (1 _ W Ym (s) + e a ^ Y \{t}(s)) = (^ S - 1)(1 - *7\{r}(.s)). 



which gives the last line of (Equation 24). 



Appendix 2 

Here, we provide the remaining simulation details. 

An input pattern has a duration of T = 500 ms and is made up from 150 fixed 
spike trains chosen independently from a Poisson process with a mean firing rate of 
6 Hz (independent realizations are used for each pattern). We think of the input as 
being generated by an input layer with 150 sites, with each NMDA-zone having a 
50% probability of being connected to one of the sites. Hence, on average a NMDA- 
zone receives 75 input spike trains and 37.5 spike trains are shared between any two 
NMDA-zones. 

A roughly optimized learning rate was used for all tasks and learning rules. 
Roughly, optimized means that the used learning rate ??* yields a performance which 
is better that when using 1.5 77* or 77*/ 1.5. 

In obtaining the learning curves, for each run a moving average of the actual trial 
by trial performance was computed using an exponential filter with time constant 0.1. 
Mean learning curves where subsequently obtained by averaging over 40 runs. The 
exception to this is the single run learning curve in panel 6C. There, subsequently 
to each learning trial, 100 non-learning trials were used for estimating mean perfor- 
mance. 

Initial weights for each run were picked independently from a Gaussian with mean 
and variance equal to 0.5. Euler's method with a time step of 0.2 ms was used for 
numerically integrating the differential equations. 
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